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Abstract — Generalized autoregressive conditional heteroscedasticity 
(GARCH) models have long been considered as one of the most 
successful families of approaches for volatility modeling in financial 
return series. In this paper, we propose an alternative approach based 
on methodologies widely used in the field of statistical machine learning. 
Specifically, we propose a novel nonparametric Bayesian mixture of 
Gaussian process regression models, each component of which models 
the noise variance process that contaminates the observed data as a 
separate latent Gaussian process driven by the observed data. This 
way we essentially obtain a mixture Gaussian process conditional 
heteroscedasticity (MGPCH) model for volatility modeling in financial 
return series. We impose a nonparametric prior with power-law nature 
over the distribution of the model mixture components, namely the 
Pitman-Yor process prior, to allow for better capturing modeled data 
distributions with heavy tails and skewness. Finally, we provide a copula- 
based approach for obtaining a predictive posterior for the covariances 
over the asset returns modeled by means of a postulated MGPCH 
model. We evaluate the efficacy of our approach in a number of 
benchmark scenarios, and compare its performance to state-of-the-art 
methodologies. 

Index Terms — Gaussian process, Pitman-Yor process, mixture model, 
conditional heteroscedasticity copula, volatility modeling. 



1 Introduction 

Statistical modeling of asset values in financial markets 
requires taking into account the tendency of assets to- 
wards asymmetric temporal dependence [IJ. Besides, the 
data generation processes of the returns of financial mar- 
ket indexes may be non-linear, non-stationary and /or 
heavy-tailed, while the marginal distributions may be 
asymmetric, leptokurtic and /or show conditional het- 
eroscedasticity. Hence, there is a need to construct flexi- 
ble models capable of incorporating these features. The 
generalized autoregressive conditional heteroscedasticity 
(GARCH) family of models has been used to address 
conditional heteroscedasticity and excess kurtosis (see, 
e.g., 0, 131). 

The time-dependent variance in series of returns on 
prices, also known as volatility, is of particular interest in 
finance, as it impacts the pricing of financial instruments, 
and it is a key concept in market regulation. GARCH 
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approaches are commonly employed in modeling fi- 
nancial return series that exhibit time-varying volatility 
clustering, i.e. periods of swings followed by periods of 
relative calm, and have been shown to yield excellent 
performance in these applications, consistently defining 
the state-of-the-art in the field in the last decade. GARCH 
models represent the variance by a function of the past 
squared returns and the past variances, which facilitates 
model estimation and computation of the prediction 
errors. 

Gaussian process (GP) models comprise one of the 
most popular Bayesian methods in the field of ma- 
chine learning for regression, function approximation, 
and predictive density estimation [4J. Despite their sig- 
nificant flexibility and success in many application do- 
mains, GPs do also suffer from several limitations. In 
particular, GP models are faced with difficulties when 
dealing with tasks entailing non-stationary covariance 
functions, multi-modal output, or discontinuities. Sev- 
eral approaches that entail using ensembles of fractional 
GP models defined on subsets of the input space have 
been proposed as a means of resolving these issues (see, 

e.g., m, m, 0). 

In this work, we propose a novel GP-based approach 
for volatility modeling in financial time series (return) 
data. Our proposed approach provides a viable alterna- 
tive to GARCH models, that allows for effectively cap- 
turing the clustering effects in the variability or volatility. 
Our approach is based on the introduction of a novel 
nonparametric Bayesian mixture model, the component 
distributions of which constitute GP regression models; 
the noise variance processes of the model component 
GPs are considered as input-dependent latent variable 
processes which are also modeled by imposition of 
appropriate GP priors. This way, our novel approach al- 
lows for learning both the observation-dependent nature 
of asset volatility, as well as the imderlying volatility 
clustering mechanism, modeled as a latent model com- 
ponent switching procedure. We dub our approach the 
mixture Gaussian process conditional heteroscedasticity 
(MGPCH) model. 

Nonparametric Bayesian modeling techniques, espe- 
cially Dirichlet process mixture (DPM) models, have 
become very popular in statistics over the last few years, 
for performing nonparametric density estimation (81, El, 
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IITOl . Briefly, a realization of a DPM can be seen as an 
infinite mixture of distributions with given parametric 
shape (e.g., Gaussian). An interesting alternative to the 
Dirichlet process prior for nonparametric Bayesian mod- 
eling is the Pitman- Yor process prior Pitman- Yor 
processes produce a small number of sparsely populated 
clusters and a large number of highly populated clusters 
[12]. Indeed, the Pitman- Yor process prior can be viewed 
as a generalization of the Dirichlet process prior, and 
reduces to it for a specific selection of its parameter 
values. Consequently, the Pitman- Yor process turns out 
to be more promising as a means of modeling complex 
real-life datasets that usually comprise a high number 
of clusters which comprise only few data points, and a 
low number of clusters which are highly frequent, thus 
dominating the entire population. 

Inspired by these advances, the component switching 
mechanism of our model is obtained by means of a 
Pitman- Yor process prior imposed over the component 
GP latent allocation variables of our model. We derive 
a computationally efficient inference algorithm for our 
model based on the variational Bayesian framework, 
and obtain the predictive density of our model using an 
approximation technique. We examine the efficacy of our 
approach considering volatility prediction in a number 
of financial return series. 

The remainder of this paper is organized as follows: 
In Section 2, we provide a brief presentation of the 
theoretical background of the proposed method. Initially, 
we present the Pitman- Yor process and its function as a 
prior in nonparametric Bayesian models; further, we pro- 
vide a brief summary of Gaussian process regression. In 
Section 3, we introduce the proposed mixture Gaussian 
process conditional heteroscedasticity (MGPCH) model, 
and derive efficient model inference algorithms based on 
the variational Bayesian framework. We also propose a 
copula-based method for learning the interdependencies 
between the returns of multiple assets jointly modeled 
by means of an MGPCH model. In Section 4, we conduct 
the experimental evaluation of our proposed model, con- 
sidering a number of applications dealing with volatility 
modeling in financial return series. In the final section, 
we summarize and discuss our results. 



2 Preliminaries 

2.1 The Pitman-Yor process 

Dirichlet process models were first introduced by Fer- 
guson ||T3l . A DP is characterized by a base distribution 
Go and a positive scalar a, usually referred to as the 
innovation parameter, and is denoted as DP{a,Go). Es- 
sentially, a DP is a distribution placed over a distribution. 
Let us suppose we randomly draw a sample distribution 
G from a DP, and, subsequently, we independently draw 
M random variables {Qm}m=i from G: 



\G ^ G, 



.M 



(2) 



Integrating out G, the joint distribution of the variables 
{®m}m=i <^^ri be shown to exhibit a clustering effect. 
Specifically, given the first M-l samples of G, {Qm}m=i' 
it can be shown that a new sample Q*j^j is either (a) 
drawn from the base distribution Gq with probability 
a+M-i ' °^ selected from the existing draws, ac- 

cording to a multinomial allocation, with probabilities 
proportional to the number of the previous draws with 
the same allocation [ill. Let {Qc}c=i be the set of distinct 
values taken by the variables {Q*n}m=i- Denoting as 



,A/-1 



the number of values in {Qm}m=i that equal to 
6c, the distribution of Q*J^.J given {0*n}m=i can be shown 
to be of the form [14] 



P(0Ml{0m}m=l '".Go) =- 



-f M - 1 
C 



-Gn 



,M-1 



+ 



c=l 



M-1 



(3) 



where Sq^ denotes the distribution concentrated at a 
single point 8c. These results illustrate two key prop- 
erties of the DP scheme. First, the innovation parameter 
a plays a key-role in determining the number of distinct 
parameter values. A larger a induces a higher tendency 
of drawing new parameters from the base distribution 
Go; indeed, as a — > oo we get G — Go. On the contrary, 
as a ^ all {6*„}m=i tend to cluster to a single random 
variable. Second, the more often a parameter is shared, 
the more likely it will be shared in the future. 

The Pitman-Yor process (PYP) [IT) functions similar to 
the Dirichlet process. Let us suppose we randomly draw 
a sample distribution G from a PYP, and, subsequently, 
we independently draw AI random variables {©*„} 
from G: 



M 



with 



G|,5,a,Go -PY(^,a,Go) 



e;jG-G, m = l,...M 



(4) 



(5) 



where 5 e [0, 1) is the discount parameter of the Pitman- 
Yor process, a > —J is its innovation parameter, and Go 
the base distribution. Integrating out G, similar to Eq. 
(3), we now yield 

a + M — i 

(6) 



E 

c=l 



,J\/-1 



M-1 



G|a,Go ~DP(a,Go) 



(1) 



As we observe, the PYP yields an expression for 
P(0Ml{Qm}m=i.G'o) quite similar to that of the DP, 
also possessing the rich-gets-richer clustering property, 
i.e., the more samples have been assigned to a draw 
from Go, the more likely subsequent samples will be 
assigned to the same draw. Further, the more we draw 
from Go, the more likely a new sample will again be 
assigned to a new draw from Go. These two effects 
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together produce a power-law distribution where many 
unique 8*^ values are observed, most of them rarely 
IITlJ , thus allowing for better modeling observations with 
heavy-tailed distributions. In particular, for S > 0, the 
number of unique values scales as 0{aM^), where M 
is the total number of draws. Note also that, for 6 — 0, 
the Pitman- Yor process reduces to the Dirichlet process, 
in which case the number of unique values grows more 
slowly at 0{alogM) |T2) . 

A characterization of the (unconditional) distribu- 
tion of the random variable G drawn from a PYP, 
PY((5, a, Go), is provided by the stick-breaking construc- 
tion of Sethuraman fl5l . Consider two infinite collections 
of independent random variables v = (wc)^i, {©dJ^Li/ 
where the Vc are drawn from a Beta distribution, and the 
0c are independently drawn from the base distribution 
Gq. The stick-breaking representation of G is then given 

by na 



where 



p{vc) = Beta(l — S,a + Sc) 

c-l 



(7) 



(8) 



and 



(10) 



Under the stick-breakrng representation of the Pitman- 
Yor process, the atoms 9c, drawn independently from 
the base distribution Go, can be seen as the parameters 
of the component distributions of a mixture model com- 
prising an unbounded number of component densities, 
with mixing proportions -ujc^v). 

2.2 Gaussian process models 

Let us consider an observation space X. A Gaussian 
process f{x), x e X, is defined as a collection of random 
variables, any finite number of which have a joint Gaus- 
sian distribution 1161 . A Gaussian process is completely 
specified by its mean function and covariance function. 
We define the mean function m{x) and the covariance 
function k{x, x') of a real process f{x) as 



7n{x) =nf{x)] 
k{x, x') ^nUix) ~ m{x)){f{x') - m{x'))] 

and we will write the Gaussian process as 

f{x) ^ Af{m{x),k{x,x)) 



(11) 



(12) 



Usually, for notational simplicity, and without any loss of 
generality, the mean of the process is taken to be zero, 
m{x) = 0, although this is not necessary. Concerning 
selection of the covariance function, a large variety of 
kernel functions k{x,x') might be employed, depending 



on the application considered ||T6l . This way, a postu- 
lated Gaussian process eventually takes the form 



f{x)r^J\f{0,k{x,x)). 



(13) 



Let us suppose a set of independent and identically 
distributed (i.i.d.) samples V = {{xi,yi)\i = 1,...,A^}, 
with the d-dimensional variables Xi being the obser- 
vations related to a modeled phenomenon, and the 
scalars yi being the associated target values. The goal 
of a regression model is, given a new observation a;*, 
to predict the corresponding target value y,, based on 
the information contained in the training set V. The 
basic notion behind Gaussian process regression consists 
in the assumption that the observable (training) target 
values y in a considered regression problem can be 
expressed as the superposition of a Gaussian process 
over the input space X, f{x), and an independent white 
Gaussian noise 

y = fix) + e (14) 



where f{x) is given by (12), and 

e - A/'(0,cr2) 



(15) 



Under this regard, the joint normality of the training 
(9) target values y = [yi]fLi ^i^d some unknown target 
value 2/*, approximated by the value /* of the postulated 
Gaussian process evaluated at the observation point a;*, 
yields |16| 



y 



where 



K{X,X)+a^lN k{x^) 

x^ 



k{xf) = [k{xi,x^), . . . , k{xN,x^)]^ 



(16) 
(17) 



X = {xi}f^^. In is the N x N identity matrix, and K 
is the matrix of the covariances between the N training 
data points (design matrix), i.e. 



K{X,X) 



k{xi,xi) k{xi,X2). 

k{x2,Xi) k{x2,X2). 
k{xN,Xi) k{xN,X2) 



k{xi,XN) 
k{x2,XN) 



k{xN,XN) 



(18) 

Then, from (16), and conditioning on the available 
training samples, we can derive the expression of the 
model predictive distribution, yielding 

p{f4x,,v)^^^{f4^l,,al) (19) 

pi'i^^k{x,)'^{K{X,X) + a^lN)-^y (20) 

crj = CT^ - fe(a;,)^ {K{X, X) + a'^InY^ k{x^) + k{x^,x^) 

(21) 

Regarding optimization of the hyperparameters of the 
employed covariance function (kernel), say 6, and the 
noise variance cr^ of a GP model, this is usually con- 
ducted by type-II maximum likelihood, that is by max- 
imization of the model marginal likelihood (evidence). 
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Using (16), it is easy to show that the evidence of the 
GP regression model yields 



\ogp{y\X;e,a^) 



N 



log27r - -log|K(X,X) + CT^j 



N\ 



-y^(K{X,X) +a^I 



N 



y 



(22) 

It is interesting to note that the GP regression model 
considers that the noise that contaminates the modeled 
output variables does not depend on the observations 
themselves, but rather that it constitutes an additive 
white noise term with constant variance, which bears 
no correlation between observations, and no dependence 
on the values of the observations. Nevertheless, in many 
real-world applications, with financial return series mod- 
eling being a characteristic example, this assumption of 
constant noise variance is clearly implausible. 

To ameliorate this issue, an heteroscedastic GP regres- 
sion approach was proposed in |[T7|. where the noise 
variance is considered to be a function of the observed 
data, similar to previously proposed heteroscedastic regres- 
sion approaches applied to econometrics and statistical 
finance, e.g., |18|, |19|. A key drawback of the approach 
of [17 \ is that their heteroscedastic regression approach 
does not allow for capturing the clustering effects in 
the variability or volatility, which is apparent in the 
vast majority of financial return series data, and is effec- 
tively captured by GARCH-type models. Our approach 
addresses these issues under a nonparametric Bayesian 
mixture modeling scheme, as discussed next. 

3 Proposed Approach 

In this section, we first define the proposed MGPCH 
model, considering a generic modeling problem that 
comprises the input variables x G MP, and the output 
variables y G M^. Further, we derive an efficient in- 
ference algorithm for our model under the variational 
Bayesian inference paradigm, and we obtain the expres- 
sion of its predictive density. Finally, we show how we 
can obtain a predictive distribution for the covariances 
between the modeled output variables {yi}^i, by uti- 
lization of the statistical tool of copulas. 

3.1 Model definition 

Let fdix) be a latent function modeling the dth output 
variable yd as a function of the model input x. We 
consider that the expression of yd as a function of x 
is not uniquely described by the latent function fd{x), 
but fd{x) is only an instance of the (possibly infinite) 
set of possible latent functions fd{x), c — 1, . . . ,oo. To 
determine the association between input samples and 
latent functions, we impose a Pitman- Yor process prior 
over this set of functions. In addition, we consider that 
each one of these latent functions /^(a;) has a prior 
distribution of the form of a Gaussian process over the 
whole space of input variables a; G M^. At this point, we 
make a further key-assumption: We assume that the noise 



variance of the postulated GPs is not a constant, but 
rather that it varies with the input variables a; G M^. In other 
words, we consider the noise variance as a latent process, 
different for each model output variable, and exhibiting 
a clustering effect, as described by the dynamics of the 
postulated PYP mixing prior. 

Let us consider a set of input /output observation 
pairs {xn,yn}n=i' comprising N samples. Let us also 
introduce the set of variables {znc}n'^i' with z„c = 1 
if the function relating a;„ to y„ is considered to be 
expressed by the set {fd{x)}d=i postulated Gaussian 
processes, z„c — otherwise. Then, based on the previ- 
ous description, we essentially postulate the following 
model: 

D 

p{yJXn,Znc = 1) = Y[Af{ynd\fd{Xn),<JdiXnf) (23) 



d=l 



with 



p{Znc = l\v) = roc(l') 
c-1 

UJc{v) = VcY[{l - V,) G [0,1] 
J = l 

OO 

'^ujciv) = 1 

c=l 

p{vc) ~ Beta(l — (5, a + Sc) 



and 



(24) 
(25) 

(26) 
(27) 

(28) 



p{f^d\X)=md\0,K^{X,X)) 

where ynd is the dth element of y„, we define X = 
{xn}^^,, Y ^ {yj^^„ and Z ^ KJ^''^,, ^ is the 
vector of the fd{xn) Vn, i.e., fd = [fd{xn)]n=i' and 
K'^{X,X) is the following design matrix 

k''{xi,Xi) k''{xi,X2) ■ ■ ■ k''{xi,XN) 

k''{x2,Xi) k''{X2,X2) ■ . ■ k''{x2,XN) 



K^iX^X) ^ 



k''{xN,Xi) k''{xN,X2) 



k''{xN,XN) 

(29)" 

Regarding the latent processes <T^(a;„)^, we choose 
to also impose a GP prior over them. Specifically, to 
accommodate the fact that a^ixn)^ > (by definition), 
we postulate 

'^diXnf 



exp[.9d(a;n)] 



and 



p(gS|X)-A^(gSKl,A^(X,X)) 



(30) 
(31) 

where is the vector of the gd{xn) Vn, i.e., = 
[9d(.Xn)]n=i' ^^'^ ^'^i^,^) is a design matrix, similar to 
K'^{X, X), but with (possibly) different kernel functions 

Finally, due to the effect of the innovation parameter 
a on the number of effective mixture components, we 
also impose a Gamma prior over it: 

p{a) ^ g{a\T]i,T]2). (32) 

This completes the definition of our proposed MGPCH 
model. 



5 



3.2 Inference algorithm 

Inference for nonparametric models can be conducted 
under a Bayesian setting, typically by means of varia- 
tional Bayes (e.g., [20J), or Monte Carlo techniques (e.g., 
ETI ). Here, we prefer a variational Bayesian approach, 
due to its considerably better scalability in terms of com- 
putational costs, which becomes of major importance 
when having to deal with large data corpora I22l , |23| . 

Our variational Bayesian inference algorithm for the 
MGPCH model comprises derivation of a family of 
variational posterior distributions q{.) which approxi- 
mate the true posterior distribution over the infinite 
sets Z, V = {v,)Zi, {flZu and {g'}Zi> and the 
innovation parameter a. Apparently, Bayesian inference 
is not tractable imder this setting, since we are dealing 
with an infinite number of parameters. 

For this reason, we employ a common strategy in 
the literature of Bayesian nonparametrics, formulated 
on the basis of a truncated stick-breaking representation 
of the PYP (20). That is, we fix a value C and we let 
the variational posterior over the Vi have the property 
q{vc = !) = !. In other words, we set rudv) equal to 
zero for c > C. Note that, under this setting, the treated 
MGPCH model involves a full PYP prior; truncation 
is not imposed on the model itself, but only on the 
variational distribution to allow for tractable inference. 
Hence, the truncation level C is a variational parameter 
which can be freely set, and not part of the prior model 
specification. 

Let W = {v,a,Z,{f}^^^,{g''}^^^} be the set of 
all the parameters of the MGPCH model over which 
a prior distribution has been imposed, and S be the 
set of the hyperparameters of the model priors and 
kernel functions. Variational Bayesian inference intro- 
duces an arbitrary distribution q{W) to approximate the 
actual posterior p{W\'B., X,Y) which is computationally 
intractable, yielding 



\0gp{X,Y)^C{q) + KL{q\\p) 



(33) 



where 



£{q) = / dWq{W)\0g P^^'J;^^^'^ (34) 

q{W) 



and KL(g||p) stands for the Kullback-Leibler (KL) diver- 
gence between the (approximate) variational posterior, 
q{W), and the actual posterior, p{W\E,X,Y). Since KL 
divergence is nonnegative, C{q) forms a strict lower 
bound of the log evidence, and would become exact if 
q{W) = p{W\'E., X,Y). Hence, by maximizing this lower 
bound C{q) (variational free energy) so that it becomes 
as tight as possible, not only do we minimize the KL- 
divergence between the true and the variational poste- 
rior, but we also implicitly integrate out the unknowns 
W. 

Due to the considered conjugate exponential prior 
configuration of the MGPCH model, the variational pos- 
terior q{W) is expected to take the same functional form 



as the prior, p{W) 

/c-i \ c D 

q{W) ^q{Z)q{a) J] ^(^c) H 11 « ifd) Q iOd) (35) 

\c=l / c=l d=l 

with 

N C 

= n n «('^"- ^ 1) (36) 



n— 1 c— 1 



Then, the variational free energy of the model reads 
(ignoring constant terms) 



C D 



=1^1^ d/rf(?(/rf)l0g —pr- 

l:i:/<w,«i).og5<?5i*5i.A-K-'^)) 

— 1 .1 — 1 



c=l d=l 

+ j dag(a)<j loi 

c-i 



p{a\ili,m) 



q{a) 

/ d«eg(^^c log 



C N 



d=l 



p{Znc = l\v) 



■EE.(-«c = i)|yd.,(.)iog 

c— 1 n—1 ^ J. \ / 

E// drMqifdH9d)logpiynd\rd{Xn),a^diXnf) 

(37) 



Derivation of the variational posterior distribution 
q{W) involves maximization of the variational free en- 
ergy C{q) over each one of the factors of q{W) in turn, 
holding the others fixed, in an iterative manner |26|. 
By construction, this iterative, consecutive updating of 
the variational posterior distribution is guaranteed to 
monotonically and maximally increase the free energy 
C{q) L25J. 

Let us denote as (.) the posterior expectation of a 
quantity. From (37), we obtain the following variational 
(approximate) posteriors over the parameters of our 
model: 

1. Regarding the PYP stick-breaking variables Vc, we 
have 

q{v,)^Beta{vc\Pc,i,Pc,2) (38) 



where 



N 



Pel = 1 - 5 + E li'^^c = 1) 



(39) 



n=l 
C N 



c'=c+l n=l 

2. The innovation parameter a yields 

q{a) = g{a\m,fi2) (41) 

where 

m = m + c-i (42) 
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C-1 



= m-Y. [^(^--2) - V'(/3c,i + Pea)] (43) 



'0(.) denotes the Digamma function, and 



(a) 



(44) 



3. Regarding the posteriors over the latent functions /^j, 
we have 



where 



Srf " dia; 



1 



N 



(45) 

(46) 
(47) 

(48) 



and y^i ^ [ynd]n=i- 

4. Similar, regarding the posteriors over the latent noise 
variance processes g^, we have 



where 



5:^^((A^(x,x))-Vgs)"' 



(49) 
(50) 



m^, = A^{X, X) ( - -diag [q (z„, = 1)]^^ J 1 + m^l 

(51) 

and QJ^ is a positive semi-definite diagonal matrix, 
whose components comprise variational parameters that 
can be freely set. Note that, from this result, it follows 



(a:^(a:„)2^=cxp([ma„-i[Sa, 



(52) 



5. Finally, the posteriors over the latent variables Z yield 

g(z„c = 1) oc exp ((logti7c(v))) exp (r„c) (53) 

where 

c-l 

{\ogw,{v)) = (log(l - V,,)) + {\ogv,) (54) 



and 



D 



(55) 

where [|]„ is the nth element of vector ^, [Srf]„„ is the 
(n, n)th element of S^, and it holds 

(log«,) = 7^(/3,a) - ^(/3c,i + /3c,2) (56) 

(log(l - V,)) - V(/3c,2) - V-C/^ca + /3c,2) (57) 

As a final note, estimates of the values of the model hy- 
perparameters set S, which comprises the hyperparam- 
eters of the model priors and the kernel functions fc(-, ■) 



and A(-, •), are obtained by maximization of the model 
variational free energy C{q) over each one of them. For 
this purpose, in this paper we resort to utilization of the 
limited memory variant of the BFGS algorithm (L-BFGS) 



3.3 Predictive density 

Let us consider the predictive distribution of the dth 
model output variable corresponding to a;*. To obtain 
it, we begin by deriving the predictive posterior distri- 
bution over the latent variables /. Following the relevant 
derivations of Section 2.2, we have 



where 



(58) 



= k^x^y [K^iX, X) + (BS)-^ j (59) 

[al^f = -k^ix^ f {k^{X,X) + [B^-'Y' k^x^) 

I Ay (^x ^ ^ X ^'j 

(60) 

(61) 
(62) 



Pel + Pc 



and 



k{x^) = [k{xi,x^),...,k{xN,x^ 



(63) 



Further, we proceed to the predictive posterior distri- 
bution over the latent variables g; we yield 



where 



(64) 



<, = A=(a;.)^ (Q^--)l + ™5 (65) 



^td - X'{x^,x.f - A^(^*)^ (A5 + iQdr') A=(^*) 

(66) 

and 

\{x^) = [X{xi,x^), X{xn, a;*)]^ (67) 

Based on these results, the predictive posterior of our 
model output variables yields 

c 



j J^{y*d ^{TUc{v))al^, 



<i{y*ci) = \ M\y*d 
c 



c=l 
2 \_c \2 



(68) 
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We note that this expression does not yield a Gaussian 
predictive posterior. However, it is rather straightfor- 
ward to compute the predictive means and variances of 
It holds 

c 

y*d = E [y*d \x^]V\ = ^ (tuc (v) ) a^^ (69) 



and 



c=l 



(70) 



where 



^ E[exp(5S,)|a^,; P] = exp + (71) 

3.4 Learning the covariances between the modeled 
output variables 

As one can observe from (23), a characteristic of our 
proposed MGPCH model is its assumption that the 
distribution of the modeled output vectors y G MP fac- 
torizes over their component variables {yd]d=i- Indeed, 
this type of modeling is largely the norm in Gaussian 
process-based modeling approaches I.16J . This construc- 
tion in essence implies that, under our approach, the 
modeled output variables are considered independent, 
i.e. their covariance is always assumed to be zero. How- 
ever, when jointly modeling the return series of various 
assets, the modeled output variables (asset returns) are 
rather strongly correlated, and it is desired to be capable 
of predicting the values of their covariances for any 
given input value. 

Existing approaches for resolving these issues of GP- 
based models are based on the introduction of an addi- 
tional kernel-based modeling mechanism that allows for 
capturing this latent covariance structure |28|, |29|, |30l, 
||3T1 , |32| . For example, in 128| the authors propose uti- 
lization of a convolution process to induce correlations 
between two output components. In 131], a generaliza- 
tion of the previous method is proposed for the case of 
more than two modeled outputs combined under a con- 
volved kernel. Along the same lines, multitask learning 
approaches for resolving these issues are presented in 
129 i and [30], where separate GPs are postulated for each 
output, and are considered to share the same prior in the 
context of a multitask learning framework. 

A drawback of the aforementioned existing ap- 
proaches is that, in all cases, learning entails employing 
a tedious optimization procedure to estimate a large 
number of hyperparameters of the used kernel functions. 
As expected, such a procedure is, indeed, highly prone 
to getting trapped to bad local optima, a fact that might 
severely undermine model performance. 

In this work, to avoid being confronted with such 
optimization issues, and inspired by the financial re- 
search literature, we devise a novel way of capturing 
the interdependencies between the modeled output vari- 
ables {yd]d=i' expressed in the form of their covariances: 



specifically, we use the statistical tool of copulas |[33]. The 
copula, introduced in the seminal work of Sklar [33], 
is a model of statistical dependence between random 
variables. A copula is defined as a multivariate distri- 
bution with standard uniform marginal distributions, 
or, alternatively, as a function (with some restrictions 
mentioned for example in Il34l ) that maps values from 
the unit hypercube to values in the unit interval. 

3.4. 1 Copulas: An introduction 

Let y = [yd]d=i t>e a /^-dimensional random vari- 
able with joint cumulative distribution function (cdf) 
Fi[yd]d=i)' and marginal cdf's Fdiyd), d = 1,...,D, 
respectively. Then, according to Sklar 's theorem, there 
exists a D-variate copula cdf C {■,... ,■) on [0, 1]^ such 
that 

Fiyu...,yD)^C{FM,...,FD{yD)) (72) 

for any y G M^. Additionally, if the marginals 
Fdi'), d ~ 1,...,D, are continuous, then the D-variate 
copula C(-,...,-) satisfying (72) is unique. Conversely, 
if C (■,...,■) is a D-dimensional copula and Fi{-), i = 
1, . . . , D, are univariate cdf's, it holds 



C {ui, ...,ud)=F {F,-\u,), F-\ud)) 



(73) 



where F, ^ 



^ (•) denotes the inverse of the cdf of the dth 
marginal distribution Fd{-), i.e. the quantile function of 
the dth modeled variable yd- 

It is easy to show that the corresponding probability 
density function of the copula model, widely known as 
the copula density function, is given by 



c(wi,...,u_d) 



dui . . . duD 
dui . . . duD 



C(mi,...,md) 
F{F,-\u,),... 



^FoHud)) 



p{F,-Hui),...,FE'{ud)) 



no 



(74) 

where pi (•) is the probability density function of the ith 
component variable yi. 

Let us now assume a parametric class for the copula 
C(-,...,-) and the marginal cdf's Fi{-), i — 1,...,D, 
respectively. In particular, let ( denote the (trainable) 
parameter (or set of parameters) of the postulated cop- 
ula. Then, the joint probability density of the modeled 
variables y = [yi]fLi yields 



piyi,- ■ ■ ,yD\0 



D 



Y[piiyi) 



(75) 



Since the emergence of the concept of copula, several 
copula families have been constructed, e.g., Gaussian, 
Clayton, Frank, Gumbel, Joe, etc, that enable capturing of 
any form of dependence structure. By coupling different 
marginal distributions with different copula functions, 
copula-based models are able to model a wide variety 
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of marginal behaviors (such as skewness and fat tails), 
and dependence properties (such as clusters, positive or 
negative tail dependence) |34|. Selection of the best-fit 
copula has been a topic of rigorous research efforts dur- 
ing the last years, and motivating results have already 
been achieved |35] (for excellent and detailed discussions 
on copulas, c.f. |34| , (36l ). 

3.4.2 Proposed Approach 

In this work, to capture the interdependencies (covari- 
ances) between the MGPCH-modeled output variables, 
we propose a conditional copula-hased dependence mod- 
eling framework. Specifically, for the considered D- 
dimensional output vectors y = [yd]d=i> we postulate 
pairwise parametric conditional models for each output pair 
(.yi^yj)Fj=i,i^j' '^ith cdf's defined as follows: 

Fiy,,y,\x) = C{F,{y,\x),F,{yj\x)\x) (76) 

where the marginals Fd{yd\x) are the cdf's that cor- 
respond to the predictive posteriors q{y*d) given by 
(68), and the used input-conditional copulas are defined 
under a parametric construction as 

C{ui,Uj\x) = C{ui,Uj\Cij{x)) (77) 

and we consider that the Qj {x) are given by 

QA=^) ^ ^{l^Ax)) (78) 

where the jijix) are trainable real-valued models, and 
^(•) is a link fimction ensuring that the values of Qij{x) 
will always be within the range allowed by the copula 
model employed each time. For instance, if a Clayton 
copula C(-) is employed, it is required that its parameter 
be positive, i.e. Qj [x) > [34J; in such a case, ^(■) may be 
defined as the exponential fimction, i.e. ^(a) — exp(Q). 

Note that the predictive posteriors q{y*d) are difficult 
to compute analytically, since (68) does not yield a 
Gaussian distribution. For this reason, and in order to 
facilitate efficient training of the postulated pairwise con- 
ditional copula models, in the following we approximate 
(68) as a Gaussian with mean and variance given by (69) 
and (70), respectively. 

Further, we consider the functions jij{x) to be linear 
basis functions models. Specifically, we postulate 

-f,j{x) ^ wf^hix) (79) 

where the Wij are trainable model parameters, and the 
basis functions h{x) are defined using a small set of basis 
input observations {xi}f^^, and an appropriate kernel 
function k: 

h{x)^[k{x,x,)]i^i (80) 

Training for the postulated pairwise conditional copula 
models can be performed by optimizing the logarithm 
of the copula density function that corresponds to the 



parametric conditional model (77), given a set of training 
data V = {xmyn)n=i' which yields 

N 

= X! ^ {P^iym\Xn),Fj{ynj\x.n)\^ {wfjh{x,i))) , 
n=l 

(81) 

with respect to the parameter vectors Wij. To effect 
this procedure, in this paper we resort to the L-BFGS 
algorithm [27J. 

After training the postulated pairwise models 
C{ui,Uj\Cij{x)) Vi 7^ j, computation of the predictive 
covariance V[y,i,y*j|a;,;X'] between the ith and the jth 
model output given the input observation a;* can be 
conducted using the corresponding conditional copula 
model and marginal predictive densities. Specifically, 
from Hoeffding's lemma IST], ||38l, EH, we directly 
obtain [Eq. (82)]; this latter integral can be approximated 
by means of numerical analysis methods. 

4 Experimental Evaluation 

In this section, we elaborate on the application of our 
MGPCH approach to volatility modeling for financial 
return series data. We perform an experimental eval- 
uation of its performance in volatility modeling, and 
examine how it compares to state-of-the-art competitors. 
We also assess the efficacy of the proposed copula- 
based approach for learning the predictive covariances 
between the modeled output variables of the MGPCH 
model. 

For this purpose, we consider modeling the daily 
return series of various financial indices, including cur- 
rency exchange rates, global large-cap equity indices, 
and Euribor rates. We note that, in this work, asset return 
r{t) is defined as the difference between the logarithm 
of the prices p{t) in two subsequent time points, i.e., 
r{t) = \ogp{t) — \ogp{t — 1). All our source codes were 
developed in MATLAB R2012a. 

4.1 Volatility prediction using the MGPCH model 

In this set of experiments, we consider three application 
scenarios: 

• In the first scenario, we model the return series 
pertaining to the following currency exchange rates, 
over the period December 31, 1979 to December 31, 
1998 (daily closing prices): 

1. (AUD) Australian Dollar / US $ 

2. (GBP) UK Pound / US $ 

3. (CAD) Canadian Dollar / US $ 

4. (DKK) Danish Krone / US $ 

5. (FRF) French Franc / US $ 

6. (DEM) German Mark / US $ 

7. (JPY) Japanese Yen / US $ 

8. (CHF) Swiss Franc / US $. 

* In the second scenario, we model the return series 
pertaining to the following global large-cap equity 
indices, for the business days over the period April 
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V[y«,2/*j|a;*;2?] 



J J [C{F,{K\x,),Fj{K'\x^)\^{wJjh{x,))) - F,{k\x^)Fj{k'\x^)] dudn' 



(82) 



27, 1993 to July 14, 2003 (daily closing prices): 

1. (TSX) Canadian TSX Composite 

2. (CAC) French CAC 40 

3. (DAX) German DAX 

4. (NIK) Japanese Nikkei 225 

5. (FTSE) UK FTSE 100 

6. (SP) US S&P 500. 

• Finally, in the third scenario, we model the return 
series pertaining to the following seven global large- 
cap equity indices and Euribor rates, for the business 
days over the period February 7, 2001 to April 24, 
2006 (daily closing prices for the first 6 indices, and 
annual percentage rate converted to daily effective 
yield for the last index): 

1. (TSX) Canadian TSX Composite 

2. (CAC) French CAC 40 

3. (DAX) German DAX 

4. (NIK) Japanese Nikkei 225 

5. (FTSE) UK FTSE 100 

6. (SP) US S&P 500 

7. (EB3M) Three-month Euribor rate. 

These series have become standard benchmarks for as- 
sessing the performance of volatility prediction algo- 
rithms |40J, im, II421 . 

In all the considered scenarios, the proposed MGPCH 
model is trained using as input data, x{t), vectors con- 
taining the daily returns of all the assets considered in 
each scenario. The corresponding training output data 
y{t) essentially comprise the same series of input vectors 
shifted one-step ahead. In other words, the output series 
are defined as y(t) = r(t + 1), t > 0, and the input series 
as x{t) = r{t), t < T, where T is the total duration of 
the modeled return series, and the vectors r{t) contain 
the return values of all the considered indices at time t. 

In our experiments, we evaluate the MGPCH model 
using zero kernels for the mean process, i.e. k^{x,x') = 
Vc; this construction allows for our model to remain 
consistent with the existing literature, where it is typi- 
cally considered that the modeled return series constitute 
a zero-mean noise-orJy process, i.e. f^{x) = Vd, c. 
Note though that our approach can seamlessly deal with 
learning the mean process ffi{x), if a model for its co- 
variance is available. Further, we consider autoregressive 
kernels of order one for the noise variance process of the 
model, of the form 



X^ix^x') 



4 



(83) 



where the ((> and CTq are model hyperparameters, esti- 
mated by means of free energy optimization (using the 
L-BFGS algorithm). 

To obtain some comparative results, we also eval- 
uate: (i) a common baseline approach from the field 



of financial engineering and econometrics, namely the 
GARCH(1,1) model 13|, that is a GARCH model with 
volatility terms of order one and residual terms of order 
one; and (ii) the recently proposed VHGP approach of 
IITTI . Both these approaches have been shown to be very 
competitive in the task of volatility prediction in finan- 
cial return series |43l, flT]. Note that the GARCH(1,1) 
model uses as input the time variable, while the VHGP 
model is trained similar to MGPCH. 

In our experiments, similar to [41], all the evaluated 
methods are trained using a rolling window of the previ- 
ous 120 days of returns to make 1, 7, and 30 days ahead 
volatility forecasts; we retrain the models every 7 days. 
We use two performance metrics to evaluate the consid- 
ered algorithms: The first one is the mean squared error 
(MSE) between the model-estimated volatilities and the 
squared returns of the modeled return series. The second 
one is the MSE between the generated predictions and 
the historical volatilities computed over rolling windows 
of 10 contiguous return values (days). As discussed 
in |j44), these two groundtruth measurements (squared 
returns and historical volatilities) constitute two of the 
few consistent ways of volatility measuring. 

In Tables 1-3, we provide the obtained results for 
the three considered scenarios. These results are means 
of the obtained MSEs over all the assets modeled in 
each scenario. As we observe, our approach yields a 
clear advantage and a significant improvement over its 
competitors, of at least one order of magnitude, in all 
the considered scenarios, in terms of both the employed 
evaluation metrics. 

4.2 Copula-based modeling of the covariances be- 
tween asset returns 

Here, we evaluate the performance of the proposed 
copula-based approach for learning a predictive model 
of the covariances between the MGPCH-modeled asset 
returns. For this purpose, we repeat the previous ex- 
perimental scenarios, with the goal now being to obtain 
predictions regarding the covariances between the assets 
modeled each time. 

In our experiments, we consider application of three 
popular Archimedean copula types, namely Clayton, 
Frank, and Gumbel copulas |34|. The employed MGPCH 
models are trained similar to the previous experiments. 
The postulated conditional-copula pairwise models use 
a basis set of input observations (to compute the h{x) 
in (80)) that comprises the 10% of the available training 
data points, i.e. 12 data points sampled at regular time 
intervals (one sample every 10 days). 

To obtain some comparative results, we also evaluate 
the performance of two state-of-the-art methods used 
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Table 1 

First Scenario: Obtained MSEs considering comparison against botin tine squared returns and historical volatility. 



Evaluation Metric 


Squared Returns 


Historical Volatility 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 


1-step 


7-step 


30-step 


Average 


GARCH 


4.97x10^' 


4.99x10-' 


5.11x10-' 


5.03x10-' 


4.98x10-' 


5.01x10-' 


5.08x10-' 


5.02x10-' 


VHGP 


2.13xl0-» 


2.15xl0-» 


2.16xl0-» 


2.15x10-" 


1.63xl0-» 


1.63x10-" 


1.62x10-" 


1.63x10-" 


MGPCH 


1.46x10"'' 


1.46x10"'' 


1.48x10-'' 


1.47x10-" 


1.03x10-'-' 


1.02x10"^ 


1.02x10-^ 


1.03x10-^ 



Table 2 

Second Scenario: Obtained MSEs considering comparison against both the squared returns and historical volatility. 



Evaluation Metric 


Squared Returns 


Historical Volatility 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 


1-step 


7-step 


30-step 


Average 


GARCH 


9.47x10-'^ 


9.56x10-'^ 


9.96x10-'^ 


9.66x10-'^ 


9.99x10-" 


1.00x10-"' 


1.03x10-'' 


1.01x10-'' 


VHGP 


3.40x10-' 


3.42x10-' 


3.53x10-' 


3.45x10-' 


4.10x10-' 


4.06x10-' 


3.98x10-' 


4.05x10"' 


MGPCH 


1.37x10-' 


1.39x10-' 


1.45x10-' 


1.41x10-' 


3.52x10-" 


3.52x10-" 


3.47x10-" 


3.50x10-" 



Table 3 

Third Scenario: Obtained MSEs considering comparison against both the squared returns and historical volatility. 



Evaluation Metric 


Squared Returns 


Historical Volatility 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 


1-step 


7-step 


30-step 


Average 


GARCH 


3.91x10-' 


4.14x10-' 


4.94x10-' 


4.33x10-' 


4.50x10-' 


4.61x10-' 


5.06x10-' 


4.72x10-' 


VHGP 


3.94x10-' 


4.03x10-' 


4.26x10-' 


4.08x10-' 


4.87x10-' 


4.89x10-' 


4.91x10-' 


4.88x10-' 


MGPCH 


1.44x10-' 


1.45x10-' 


1.52x10-' 


1.47x10-' 


4.36x10-" 


4.42x10-" 


4.42x10-" 


4.40x10-" 



for modelmg dynamic covariance matrices (multivariate 
volatility) for high-dimensional vector-valued observa- 
tions; specifically, we consider the CCC-MVGARCH(1,1) 
approach of \45j, and the GARCH-BEKK(1,1) method of 
Il46l . As our evaluation metric, we use the products of 
the returns of the corresponding asset pairs at each time 
point. Our obtained results are depicted in Tables 4-6. 
We observe that our approach yields a very competitive 
result: specifically, in two out of the three considered 
scenarios, the yielded improvement was equal to or 
exceeded one order of magnitude, while, in one case, 
all methods yielded comparable results. We also observe 
that switching the employed Archimedean copula type 
had only marginal effects on model performance, in all 
our experiments. 

5 Conclusions 

In this paper, we proposed a novel nonparamet- 
ric Bayesian approach for modeling conditional het- 
eroscedasticity in financial return series. Our approach 
consists in the postulation of a mixture of Gaussian 
process regression models, each component of which 
models the noise variance process that contaminates 
the observed data as a separate latent Gaussian process 
driven by the observed data. We imposed a nonpara- 
metric prior with power-law nature over the distribution 
of the model mixture components, namely the Pitman- 
Yor process prior, to allow for better capturing modeled 
data distributions with heavy tails and skewness. In ad- 
dition, in order to provide a predictive posterior for the 
covariances over the modeled asset returns, we devised 
a copula-based covariance modeling procedure built on 
top of our model. To assess the efficacy of our approach. 



we applied it to several asset return series, and compared 
its performance to several state-of-the-art methods in the 
field, on the grounds of standard evaluation metrics. As 
we observed, our approach yields a clear performance 
improvement over its competitors in all the considered 
scenarios. 
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Table 4 

First Scenario: Obtained MSEs considering comparison against tine asset pair return products. 



Evaluation Metric 


Squared Returns 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 


CCC-MVGARCH 


1.19x10"' 


1.18x10"' 


1.20x10"' 


1.20x10"' 


BEKK 
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1.17x10"' 


1.16x10"' 


MGPCH: Clayton 
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1.17x10"'' 


1.17x10-' 


1.17x10-' 


MGPCH: Frank 


1.16x10"' 
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1.17x10"' 


MGPCH: Gumbel 


1.16x10"' 


1.16x10"' 


1.16x10"' 


1.16x10"' 



Table 5 

Second Scenario: Obtained MSEs considering comparison against the asset pair return products. 



Evaluation Metric 


Squared Returns 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 
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1.4x10-" 
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1.4x10-" 


1.4x10-" 


BEKK 


1.7x10-" 
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1.7x10-" 


1.7x10"" 


MGPCH: Clayton 


3.1x10"' 


3.1x10-' 


3.1x10-' 


3.1x10"' 


MGPCH: Frank 


3.2x10"' 


3.2x10-' 


3.2x10"' 


3.2x10"' 


MGPCH: Gumbel 


3.1x10"' 


3.1x10"' 


3.1x10"' 


3.1x10"' 



Table 6 

Third Scenario: Obtained MSEs considering comparison against the asset pair return products. 



Evaluation Metric 


Squared Returns 


Prediction Horizon 


1-step 


7-step 


30-step 


Average 


CCC-MVGARCH 


0.0044 


0.0045 


0.0047 


0.0045 


BEKK 


0.84 


0.85 


0.85 


0.85 


MGPCH: Clayton 


9.81x10 ' 


9.81x10 


9.85x10 


9.83x10 


MGPCH: Frank 


9.82x10"" 


9.81x10"" 


9.82x10"" 


9.82x10"" 


MGPCH: Gumbel 


9.81x10"" 


9.82x10"" 


9.82x10"" 


9.82x10"" 



[12] Y. W. Teh, "A hierarchical Bayesian language model based on 
Pitman-Yor processes," in Proc. Association for Computational Lin- 
guistics, 2006, pp. 985-992. 

[13] T. Ferguson, "A Bayesian analysis of some nonparamettic prob- 
lems," The Annals of Statistics, vol. 1, pp. 209-230, 1973. 

[14] D. Blackwell and J. MacQueen, "Ferguson distributions via P61ya 
urn schemes," The Annals of Statistics, vol. 1, no. 2, pp. 353-355, 
1973. 

[15] J. Sethuraman, "A constructive definition of the Dirichlet prior," 

Statistica Sinica, vol. 2, pp. 639-650, 1994. 
[16] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for 

Machine Learning. MIT Press, 2006. 
[17] M. LSzaro-Gredilla and M. Tsitsias, "Variational heteroscedastic 

Gaussian process regression," in Proc. 28ih International Conference 

on Machine Learning, BeUevue, WA, USA, 2011. 
[18] K. Kersting, C. Plagemarm, P. Pfaff, and W. Burgard, "Most likely 

heteroscedastic Gaussian processes regression," in Proc. of the 

ICML, 2007, pp. 393-400. 
[19] C. Brooks, S. Burke, and G. Persand, "Benchmarks and the 

accuracy of GARCH model estimation," International Journal of 

Forecasting, vol. 17, pp. 45-56, 2001. 
[20] D. M. Blei and M. I. Jordan, "Variational inference for Dirichlet 

process mixtures," Bayesian Analysis, vol. 1, no. 1, pp. 121-144, 

2006. 

[21] Y. Qi, J. W. Paisley, and L. Carin, "Music analysis using hidden 
Markov mixture models," IEEE Transactions on Signal Processing, 
vol. 55, no. 11, pp. 5209-5224, 2007. 

[22] W. Fan, N. Bouguila, and D. Ziou, "Variational learning for finite 
Dirichlet mixture models and applications," IEEE Transactions on 
Neural Networks and Learning Systems, vol. 23, no. 5, pp. 762 - 774, 
2012. 

[23] A. Penalver and F. Escolano, "Entropy-based incremental varia- 
tional Bayes learning of Gaussian mixtures," IEEE Transactions on 
Neural Networks and Learning Systems, vol. 23, no. 3, pp. 534 - 540, 
2012. 

[24] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, "An intro- 



duction to variational methods for graphical models," in Learning 
in Graphical Models, M. Jordan, Ed. Dordrecht: Kluwer, 1998, pp. 
105-162. 

[25] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, "Signal modeling 
and classification using a robust latent space model based on t 
distributions," ZEE£ Trans. Signal Processing, vol. 56, no. 3, pp. 
949-963, March 2008. 

[26] D. Chandler, Introduction to Modern Statistical Mechanics. New 
York: Oxford University Press, 1987. 

[27] D. Liu and J. Nocedal, "On the limited memory method for large 
scale optimization," Mathematical Programming B, vol. 45, no. 3, 
pp. 503-528, 1989. 

[28] P. Boyle and M. Frean, "Dependent Gaussian processes," in Proc. 
Neural Information Processing Systems, 2005, pp. 217-224. 

[29] K. Yu, V. Tresp, and A. Schwaighofer, "Learning Gaussian pro- 
cesses from multiple tasks," in Proc. of the 22nd Int'l Conf. on 
Machine learning (ICML), 2005, pp. 1012-1019. 

[30] E. V. Bonilla, K. M. A. Chai, and C. K. 1. Williams, "Multi- 
task Gaussian process prediction," in Proc. Neural Information 
Processing Systems, 2008. 

[31] M. Alvarez and N. Lawrence, "Sparse convolved multiple output 
gaussian processes," in Proc. Neural Information Processing Systems, 
2008, pp. 57-64. 

[32] L. Bo and C. Sminchisescu, "Twin Gaussian processes for struc- 
tured prediction," Int. J. of Comput. Vision, vol. 87, no. 1-2, pp. 
28-52, 2010. 

[33] A. Sklar, "Functions de repartition h n dimensions et luers 
marges," Publications de I'Insiitut de Statistique de I'Universite de 
Paris, vol. 8, pp. 229-231, 1959. 

[34] R. Nelsen, An Introduction to Copulas, set. Springer Series in 
Statistics. New York: Springer, 2006. 

[35] F. Abegaz and U. Naik-Nimbalkar, "Modeling statistical depen- 
dence of Markov chains via copula models," Journal of Statistical 
Planning and Inference, vol. 138, pp. 1131-1146, 2008. 

[36] H. Joe, Multivariate Models and Dependence Concepts, ser. Mono- 



12 



graphs in Statistics and Applied Probability. London: Chapman 
and Hall, 1997, vol. 73. 

[37] W. Hoeffding, "Masstabinvariante korrelations theorie," Schriften 
des Mathematischen Instituts und des Instituts filr Angewandte Math- 
ematik der Universitat Berlin, vol. 5, no. 3, pp. 179-233, 1940. 

[38] H. Block and Z. Fang, "A multivariate extension of Hoeffding's 
lemma," The Annals of Pmhahility, vol. 16, pp. 1803-1820, 1988. 

[39] C. M. Cuadras, "Correspondence analysis and diagonal expan- 
sions in terms of distribution functions," Journal of Statistical 
Planning and Inference, vol. 103, pp. 137-150, 2002. 

[40] B. McCuUough and C. Renfro, "Benchmarks and software stan- 
dards: A case study of GARCH procedures," Journal of Economic 
and Social Measurement, vol. 25, pp. 59-71, 1998. 

[41] A. G. Wilson and Z. Ghahramani, "Copula processes," in Advances 
in Neural Information Processing Systems, 2010. 

[42] C. Brooks, S. Burke, and G. Persand, "Benchmarks and the 
accuracy of GARCH model estimation," International Journal of 
Forecasting, vol. 17, pp. 45-56, 2001. 

[43] P. R. Hansen and A. Lunde, "A forecast comparison of volatility 
models: Does anything beat a GARCH(1,1)," Journal of Applied 
Econometrics, vol. 20, no. 7, pp. 873-889, 2005. 

[44] C. T. Brownlees, R. F. Engle, and B. T. Kelly, "A practical guide 
to volatility forecasting through calm and storm," 2009, available 
at SSRN: http://ssrn.com/abstract=1502915 

[45] R. F. Engle, "Dynamic conditional correlation: a simple class of 
multivariate generalized autoregressive conditional heteroskedas- 
ticity models," Journal of Business & Economic Statistics, vol. 20, 
no. 3, pp. 339-350, 2002. 

[46] R. F. Engle and K. F. Kroner, "Multivariate simultaneous general- 
ized ARCH," Econometric Theory, vol. 11, pp. 122-150, 1995. 



